Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet3d rcp fix v3.1 #327

Merged
merged 7 commits into from
Sep 28, 2023
Merged

Conversation

mmarcinkiewicz
Copy link
Contributor

UNET3D RCPs don't exhibit a typical convex curve on a plot epochs to converge vs GBS. Instead, it's convergence is a bit chaotic - this is a result of the dataset being very small and the fact that training happens on GBS which are a large fraction of the dataset (1/3 or even 1/2). Because of that, the RCP pruning doesn't work as expected, potentially passing a questionable results. Here's the current plot of epochs to converge vs GBS:
image

My proposition is to replace epochs to converge with samples to converge (like BERT and LLM currently do). To do that, I simply multiply the epochs to converge by samples per epoch. The new plot would look like:
image
The behaviour is much more stable, there are no visible dips for RCP pruning to malfunction.

If the PR is approved, I'll prepare appropriate changes to rules and the reference.

mmarcinkiewicz and others added 5 commits May 9, 2023 12:46
In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly.

Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence.

I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes.
The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant:

GBS=32
old samples per epoch: 192
old RCP mean: 1974
new samples per epoch: 160
new RCP mean: 2409
ratio: 1.22, expected ratio (192/160): 1.2

GBS=56
old samples per epoch: 168
old RCP mean: 2213
new samples per epoch: 168
new RCP mean: 2300
ratio: 1.04, expected ratio: 1.00

GBS=64
old samples per epoch: 192
old RCP mean: ---
new samples per epoch: 128
new RCP mean: 3270.77
ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51

GBS=80
old samples per epoch: 240
old RCP mean: 1618.75
new samples per epoch: 160
new RCP mean: 2462.51
ratio 1.52, expected ratio: 1.5

GBS=84
old RCP mean: 2233.25
new RCP mean: 2308.695652173913
ratio: 1.03, expected ratio: 1.0
@mmarcinkiewicz mmarcinkiewicz requested review from a team as code owners August 3, 2023 13:47
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@nv-rborkar
Copy link
Contributor

Training WG 08/17: Generally accepted as ok in meeting.

xyhuang
xyhuang previously approved these changes Sep 5, 2023
@pgmpablo157321
Copy link
Contributor

@mmarcinkiewicz Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

…to unet3d-rcp-fix-v3.1

# Conflicts:
#	mlperf_logging/rcp_checker/training_3.0.0/rcps_unet3d.json
@nv-rborkar
Copy link
Contributor

@pgmpablo157321 changes have been moved to 3.1.0 folder. Please take a look

Copy link
Contributor

@pgmpablo157321 pgmpablo157321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nv-rborkar nv-rborkar merged commit c27d838 into mlcommons:master Sep 28, 2023
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants